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We study statistical properties of DNA chains of thirteen 
microbial complete genomes. We find that the power spec- 
trum of several of the sequences studied flattens off in the low 
\ frequency limit. This implies the correlation length in those 
sequences is much smaller than the entire DNA chain. Conse- 
quently, in contradiction with previous studies, we show that 
the fractal behavior of DNA chains not always prevail through 
the entire DNA molecule. 



I. INTRODUCTION 

The statistics of DNA sequences is an active topic of re- 
search nowadays. There are studies on the power spectral 
density, random walker representation, correlation func- 
tion etc. Although some of the studies are in contra- 
diction with each other, there is a consensus with respect 
to the reported behavior of the power spectrum of DNA 
sequences. For high frequencies it is roughly flat, with a 
sharp peak at / = 1/3, which has been shown to be due 
to nonuniform codon usage (^,^. For smaller frequencies, 
it has been reported that it presents a power-law behav- 
ior with exponent approximately equal to —1, that is, 
l/f noise. Since a cutoff of the power-law exists at high 
frequencies, it has been called "partial power-law" [Q. 
The presence of "l/f noise in a given frequency interval 
indicates the presence of a self-similar (fractal) structure 
in the corresponding range of wavelengths, whereas a flat 
power spectrum indicates absence of correlations (white 
noise) . 

It is an important question to know whether or not 
the power-law behavior of the power spectrum of a given 
DNA chain extents up to the smallest frequencies. If this 
occurs, it would imply that the fractal behavior of that 
DNA chain spans to the entire chain, and that the cor- 
relation length of the chain is not smaller than the chain 
size. Some studies have claimed that the fractal behav- 
ior of DNA prevails through the entire DNA molecule . 
The aim of this paper is to show that this is not generally 
correct. 

We have done statistical analysis of the DNA of 
thirteen microbial complete genomes that is, Ar- 
chaeoglobus fulgidus (2178400 bp), Aquifex aeolicus 
(1551335 bp). Bacillus suhtilis (4214814 bp). Chlamy- 
dia trachomatis (1042519 bp), Escherichia coli, also 
known as Ecoli (4639221 bp), Treponema pallidum 
(1138011 bp), Haemophilus influenzae Rd (1830138 bp). 



Helicobacter pylori 26695 (1667867 bp). Mycoplasma 
pneumoniae (816394 bp), Mycobacterium tuberculosis 
H37Rv (4411529 bp), Pyro-h Pyrococcus honkoshn 0T3 
(1738505 bp), Synechocystis PCC6803 (3573470 bp), and 
Mycoplasma genitalium G37 (580073 bp). We have found 
that the behavior of power spectrum at small frequen- 
cies can be different for different organisms. Also, it can 
be different for different nucleotides in the same organ- 
ism. Thus, for some organisms, the behavior of the power 
spectrum (PS) as a function of the frequency shows, in 
a log-log plot, three different regions, instead of two, re- 
ported previously That is, as the frequency in- 
creases, it changes from (on average) a flat function, a 
power-law, and then flat again , showing that the frac- 
tal structure of DNA sequences not necessarily extends 
up to the total length of the chain. The flattening of the 
power spectrum at low frequencies is just a signature of 
the fact that the correlation length of DNA sequences is, 
for many sequences, much smaller than the entire length 
of the DNA chain. We have calculated the autocorrela- 
tion function (AF) of the nucleotides in the DNA chains 
of the organisms mentioned above. We have found that 
in some of the organisms the correlation length is of the 
order of a few thousand base-pairs. In others, the corre- 
lation length is very large, being not smaller than 100,000 
base-pairs. 

A DNA chain is represented by a sequence of four let- 
ters, corresponding to four different nucleotides: adenine 
(A), cytosine (C), guanine (G) and thymine (T). The 
calculation of the power spectrum or the autocorrelation 
function requires that this symbolic sequence be trans- 
formed into a numerical one. Several methods have been 
proposed for this Here we use the method intro- 

duced by Voss 0, which has been shown in [|lO| to be 
equivalent to the method used in Q. In Voss's method 
one associates to the site in which a given symbol is 
absent and 1 to the location where it is present. So, for a 
given DNA sequence there will be four different numeri- 
cal sequences, corresponding to the sequences associated 
with A, C, G and T. In his original paper, Voss calculated 
the PS for each one of these sequences and summed them 
to find the average PS. Here, we treat them distinctly, be- 
cause we also want to know about the similarities and dif- 
ferences of the statistical features of different nucleotides 
in a given DNA sequence. 

By artificially linking flank sequences together, 
Borstnik et al. [|ll| found a behavior for the PS as a 
function of the frequency that was flat, then an expo- 
nential decay, then flat again. Our studies of complete 
sequences show that the behavior of the PS does not 
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show any exponential decay in the region of intermediate 
frequencies. We found instead a power law. However, 
for low frequencies we also find a flat PS in several of 
the sequences studied. A flat PS at high frequencies is 
observed in all cases. 

II. STATISTICAL ANALYSIS 
A. Power Spectrum 

Let us use Voss's method [Q and denote by xf the 
numerical value associated with the symbol A. Then one 
has = 1 \i symbol A is present at location j and 
= otherwise. Similar transformation is made for 
symbols C, G and T. Consequently, the DNA can be 
divided into four different binary subsequences of O's and 
I's, associated with the symbols A, C, G, and T. 

The Fourier transform of a numerical sequence Xk of 
length N is by definition, 

N-l 

^(/j)^]^E^fcexp(-2^ifc/,), (1) 

fc=0 

where the frequency fj is given by / = j /N , and j = 
0, TV- 1. The PS is defined as S(Jj) = V{fj)V{fj)* = 
From the definition, we can see that V^(/o) =< 
Xk >, where the brackets denote average along the chain. 
Consequently, this quantity carries no information about 
the relative positions of the nucleotides. Because of this, 
we usually neglect this quantity in our calculations, that 
is, we concentrate only on frequencies with j > 0. 

Since DNA sequences have a large number of base- 
pairs, and the PS presents considerable fluctuation, some 
kind of averaging is usually done to plot this quantity as 
a function of the frequency. The way of averaging done 
so far is the following : the DNA chain of length 

N is divided into non-overlapping subsequences of length 
L. Then, the power spectrum of each of these segments 
is computed and averaged over the N/L subsequences. 
In this method the smallest frequency for which the PS 
can be calculated is, of course, / — 1/L. Consequently, 
the behavior of frequencies in the range 1/^] is un- 

known. An example of such a calculation for Ecoli is 
shown in Fig. ^, where the DNA chain was divided in 
subsequences of 8192 nucleotides. A clear power law, fol- 
lowed by an approximate flat region with a sharp peak 
at J = 1/3, is seen. To avoid overlap of the curves, we 
have displaced the PS of cytosine, guanine and thymine 
by dividing it by 10, 10^ and 10^, respectively. Since the 
power spectrum for sequences of real numbers is symmet- 
ric with respect to the axis / — 0.5, we plot only the PS 
for frequencies in the interval of to 0.5. A similar figure 
for adenine is shown in 

In this paper we show that another way of averaging al- 
lows one to calculate the PS for smaller frequencies than 



the method described above, and then verify what hap- 
pens to the power-law as the frequency decreases. More 
specifically, we calculate the mean PS in a sliding window 
of n points, with adjacent windows having an overlap of 
ri — 1 points. The average PS in each window will deter- 
mine the values of the smoothed resulting sequence. In 
mathematical terms we can express this as 

= ^ E ^(/") (2) 

m=j-A 

where A = (n — l)/2, n is taken an odd number, and 
j varies from A-fl to A^ — A — 1. Although the new 
sequence in this method is smoother than the original 
one, its length is only smaller than it by 2A points. We 
have found that this method shows the same behavior 
for moderate and high frequencies as the method used 
in 1^,^,0. However, it much superior for studies at low 
frequencies. 

To speed up the calculations of the PS we have used, as 
it is normally done, the Fast Fourier Transform algorithm 
]l2[ . This algorithm speeds up the calculation of the PS 
by a factor of N/logaN, but it requires that length of the 
sequence analyzed be an integer power of the integer a, 
which usually is taken to be two. Since the length of DNA 
sequences are not generally equal to an integer power of 
two, we take in our computation the largest subsequence, 
starting from the beginning of the chain, that fulfills this 
requirement. More specifically, we take the first N' = 2^ 
nucleotides, where K is the largest power of 2 satisfying 
the requirement that N' < N, with N being the total size 
of the DNA chain. In this way, the number of nucleotides 
not included in the calculation is always smaller, and in 
many cases much smaller, than A^/2. We have also done 
calculations considering the entire length of the DNA and 
zero padding the sequence to the next integer power of 
2, as described in ||l^. The results remain essentially the 
same as the ones we show here. 

Since our method shows the same behavior for the PS 
in the range of intermediate and large frequencies as the 
other averaging method, and also due to the large size 
of the DNA chains, we plot the PS only in the frequency 
range [1/A^, 0.01]. We show in Fig. |^ the results of our 
calculation for n = 33 for four representative cases of 
the thirteen ones studied. For clarity, we have displaced 
the PS of C, G and T by dividing it by 10, 10^ and 
10'^, respectively. In this way, an overlap of the curves 
is avoided. Our results show that the low frequency PS 
associated with each of the nucleotides in the organisms 
studied fall into one of the following cases: 

(a) All the four PS associated with the four differ- 
ent nucleotides flattens off at low frequencies. In these 
cases there are three regions in the PS versus frequency 
curve. At both low and high frequencies the PS is of 
white noise type and the middle regions is characterized 
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approximately by a power-law behavior, that is, in a log- 
log plot the PS satisfy S" ~ /"T, with 7 > 0. This is 
for example the case of Ecoli, shown in Fig. ||(a). When 
compared with Fig. [l] or with Fig. 1 of |^ we see that 
the averaging method of 0,^,0 does not show the true 
behavior of the PS at low frequencies. In this calculation 
we used the first 2^^ nucleotides, which corresponds to 
90% of the Ecoh DNA. We show another case with the 
same behavior in Fig. |^(b), which is the PS of Aquifex 
aeolicus. For the PS of Aquifex aeolicus we used the first 
10^" sites, which corresponds to 68% of the chain length. 
The other organisms, among the ones studied, that show 
the same PS behavior are Archaeoglobus fulgidus^ Syne- 
chocystis PCC6803, Mycoplasma pneumoniae, and My- 
cobacterium tuberculosis. 

(b) The second type of behavior is the one in which the 
PS at small frequencies of all the nucleotides presents a 
power law behavior, which is approximately an extension 
of the PS behavior at intermediate frequencies. For these 
organisms, the PS presents only two regions: a flat one at 
high frequencies, and a power law behavior for interme- 
diate and low frequencies. A typical case for this kind of 
behavior is shown in Fig. ^(c), which is the PS of Bacillus 
suhtilis. In the calculation of the PS in this case we have 
used the first 10^^ sites, which corresponds to 99% of the 
total length of the chain. The other organisms studied 
that have similar PS are: Treponema pallidum, Pyro-h 
Pyrococcus horikoshii 0T3, and Mycoplasma genitalium. 

(c) The third, and last, type of behavior we have seen 
is the one in which, for a given organism, different nu- 
cleotides present different asymptotic behavior for the PS 
at low frequencies. That is, the PS flattens off for some 
of the nucleotide sequences, and for the others it remains 
approximately a power-law. An example of such a behav- 
ior is shown in Fig. §(d), which is the PS of Haemophilus 
influenzae Rd. We see that different behavior for the PS 
are grouped in pairs. In all the cases studied we found 
that the PS of A is qualitatively similar to the PS of T 
and the one of C is similar to the one of G. This kind 
pairing of the statistical features of nucleotides has been 
reported for yeast chromosomes in Q]. This is probably 
caused by the strand symmetry of DNA sequences, re- 
ported in In the calculation of the PS we have used 
the first 10^" sites of the DNA chain, which corresponds 
to 57% of the total number of nucleotides. Since a large 
number of sites are left out of the calculation, we have 
also analyzed the PS of the central and final region of the 
chain. We verified that the results remain essentially the 
same as the ones shown in Fig. ||(d). The other organ- 
isms that have similar statistical features for the PS are 
Chlamydia trachomatis and Helicobacter pylori 26695. 



B. Autocorrelation Function 

The autocorrelation function R(l) of a numerical se- 
quence is, by definition, 

R{1) XkXk+i >, (3) 

where the brackets denote average over the sites along 
the chain. For I = 0, Eq. (^) implies i?(0) =< xl >, 
which is a quantity carrying no information about the 
relative position of the nucleotides. As in the case of the 
power spectrum for 5(0), this quantity will be neglected 
in our calculations. 

Statistical independence between sites separated a by 
distance I implies that < XkXk+i >=< Xk >^. The value 
of I above which this condition is satisfied (on average) is 
called the correlation length. DNA molecules, depending 
on the organism, can form an open or a closed loop. Bac- 
terial DNA usually forms a closed loop For circu- 
lar chains, the autocorrelation function and the PS form 
Fourier transform pairs (this is the Wiener-Khintchine 
theorem) . In order to consider the entire DNA se- 
quence (without having the constrains of the Fast Fourier 
algorithm) we calculate the AF using its plain definition, 
that is, Eq.(|^), and not via Fourier transforming of the 
PS |]l6|. We present results for I in the interval [1,10^]. 
This is a much larger interval than the ones considered 
in previous publications which took I in [1, 10^]. It 
is obvious that when Z << A^, as it occurs here, it does 
not matter if we consider open or closed boundary con- 
ditions. Since we find computationally easier to consider 
open boundary conditions, we present the results of the 
AF for this case. It is beyond the scope of this paper to 
study cross-correlation between two different kinds of nu- 
cleotides. Such a kind of study can be found for example 
in li- 

We show in Fig. the AF versus I for the sequences 
whose PS we displayed in Fig. ^. Since the AF presents 
a strong oscillation of period 3 we chose n to be a 
multiple of 3 in order to smooth it out. Here we have 
used 71 = 33 (there was no particular reason for choos- 
ing n a multiple of 3 in the calculation of the PS). In 
Fig. ^ the horizontal lines are the corresponding values 
of < Xk When R{1) =< XkXk+i >«< Xk >^ sta- 
tistical independence between the nucleotides of a given 
type holds. As Fig. || shows, when I < 100 the AF is 
roughly flat for some sequences, and for others it is ap- 
proximately a power-law [p^ . Then, as I increases we see 
a regime of a power-law in all cases. For the interval of 
I studied, we observe that the AF can get flat again as I 
increases even more (with R{1) «< Xk >^), or not reach 
a plateau. For the sequences where the PS flattens off 
at low frequencies, we expect that the AF will get ffat 
for larger I, with statistical independence holding. How- 
ever, for most of the cases studied, this happens when 
I » 10^. Only the AF of Aquifex aeolicus seems to 
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reach a plateau for I in the interval [1, 10^] for all the nu- 
cleotides. This is shown in Fig. ||(c) and Fig. ^d), where 
we observe that the correlation lengths for this organism 
appear to be between 10^ and 10'*. For the other organ- 
isms studied, we see a wide variety of behaviors for the 
AF in the region of / e [10^, 10^]. As Fig. || shows, we 
find cases in which the AF reaches a plateau with sta- 
tistical independence between the nucleotides, in others 
we see a slow decrease of the AF, such as the AF of A 
for Bacillus subtilis. We also find an abrupt change of 
slope in a plateau region, like the AF for A and of G in 
Haemophilus influenzae Rd. And most interestingly, we 
find the presence of anti-correlations, that is, < XkXk+i > 
being smaller than < Xk This implies that sites sep- 
arated by a given distance tend to be occupied by dif- 
ferent nucleotides. The case in which this appears more 
strongly is in the AF of C for Haemophilus influenzae 
Rd. We have also observed that most of the sequences 
present a peak in the AF at I « 100. The reason for this 
is unknown to us. 

III. CONCLUSION 

In summary, we have studied statistical properties 
of the complete DNA of thirteen microbial genomes 
and shown that its fractal behavior not always prevails 
through the entire chain. For some sequences the power 
spectrum gets flat at low frequencies, and for others it 
remains a power-law. In the study of the autocorrela- 
tion function we have found a rich variety of behaviors, 
including the presence of anti-correlations. 
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FIG. 3. Autocorrelation function of Ecoli for (a) A and 
T, (b) C and G, Aquifex Aeolicus for (c) A and T, (d) C 
and G, Bacilus subtilis for (e) A and T, (f) C and G, and 
Haemophilus influenzae Rd for (g) A and T, (h) C and G. 
The horizontal lines are < Xk >^. Statistical linear inde- 
pendence between the nucleotides in a given sequence occurs 
when R =< XkXt+i >=< Xk 
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